External Sampling

نویسندگان

  • Alexandr Andoni
  • Piotr Indyk
  • Krzysztof Onak
  • Ronitt Rubinfeld
چکیده

We initiate the study of sublinear-time algorithms in the external memory model [14]. In this model, the data is stored in blocks of a certain size B, and the algorithm is charged a unit cost for each block access. This model is well-studied, since it reflects the computational issues occurring when the (massive) input is stored on a disk. Since each block access operates on B data elements in parallel, many problems have external memory algorithms whose number of block accesses is only a small fraction (e.g. 1/B) of their main memory complexity. However, to the best of our knowledge, no such reduction in complexity is known for any sublinear-time algorithm. One plausible explanation is that the vast majority of sublinear-time algorithms use random sampling and thus exhibit no locality of reference. This state of affairs is quite unfortunate, since both sublinear-time algorithms and the external memory model are important approaches to dealing with massive data sets, and ideally they should be combined to achieve best performance. In this paper we show that such combination is indeed possible. In particular, we consider three wellstudied problems: testing of distinctness, uniformity and identity of an empirical distribution induced by data. For these problems we show random-sampling-based algorithms whose number of block accesses is up to a factor of 1/ √ B smaller than the main memory complexity of those problems. We also show that this improvement is optimal for those problems. Since these problems are natural primitives for a number of sampling-based algorithms for other problems, our tools improve the external memory complexity of other problems as well.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ethical Dilemmas in Sampling

This paper focuses on sampling as a nexus of ethical dilemmas experienced by social workers and other applied empirical researchers. It is argued here that social workers and other applied researchers have an ethical obligation to construct the smallest representative samples possible. Although random sampling is considered by many researchers as the gold standard methodological procedure for m...

متن کامل

External Sampling Publisher Accessed Terms of Use Detailed Terms External Sampling

We initiate the study of sublinear-time algorithms in the external memory model [14]. In this model, the data is stored in blocks of a certain size B, and the algorithm is charged a unit cost for each block access. This model is well-studied, since it reflects the computational issues occurring when the (massive) input is stored on a disk. Since each block access operates on B data elements in ...

متن کامل

An Importance Sampling Scheme on Dual Factor Graphs. I. Models in a Strong External Field

We propose an importance sampling scheme to estimate the partition function of the two-dimensional ferromagnetic Ising model and the two-dimensional ferromagnetic q-state Potts model, both in the presence of an external magnetic field. The proposed scheme operates on the dual Forney factor graph and is capable of efficiently computing an estimate of the partition function under a wide range of ...

متن کامل

Stimulus Sampling and Social Psychological Experimentation

The authors discuss the problem with failing to sample stimuli in social psychological experimentation. Although commonly construed as an issue for external validity, the authors emphasize how failure to sample stimuli also can threaten construct validity. They note some circumstances where the need for stimulus sampling is less obvious and more obvious, and they discuss some well-known cogniti...

متن کامل

Enhancing the LexVec Distributed Word Representation Model Using Positional Contexts and External Memory

In this paper we take a state-of-the-art model for distributed word representation that explicitly factorizes the positive pointwise mutual information (PPMI) matrix using window sampling and negative sampling and address two of its shortcomings. We improve syntactic performance by using positional contexts, and solve the need to store the PPMI matrix in memory by working on aggregate data in e...

متن کامل

Entropic sampling dynamics of the globally-coupled kinetic Ising model

The entropic sampling dynamics based on the reversible information transfer to and from the environment is applied to the globally coupled Ising model in the presence of an oscillating magnetic field. When the driving frequency is low enough, coherence between the magnetization and the external magnetic field is observed; such behavior tends to weaken with the system size. The time-scale matchi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009